In this lab we will use a data set with the number of species of ants recorded in vegetation patches of different sizes in California (Suarez et al. 2008). We will use this data to fit different models for the species-area relationship.
Load the required libraries and the data, which is available in the supplementary material of Ohyama et al. (2021):
One of the first models proposed to describe the relationship between the area of an island or of a habitat patch (\(A\)) and the number of species that this area harbors (\(S\)) is the power function:
\[ S \, = \, \alpha A^{\beta_1}\]
Taking the logarithm of both sides of this equation we have a linear function:
\[ \ln S \, = \, \beta_0 + \beta_1 \ln A \]
Where \(\beta_0= \ln \alpha\). Thus, we can evaluate if this model approximates the observed relationship with a simple linear regression:
\[ \begin{align} \ln S_i &\sim \text{Nornal}(\mu_i, \sigma) \\ \mu_i & = \beta_{0} + \beta_1 \ln A_i \end{align} \]
In fact, the species-area relationship looks linear in log-scale:
plot(Species.Richness ~ Island_area, data= suarez,
xlab = "patch area (m2)", ylab = "Number of species", log="xy")
We will use the maximum likelihood method (ML) to fit the linear regression model to data, and then we will compare the results with the least squares (LS) fit.
To perform ML fitting, we will use a computer optimization routine to find the values of parameters \(\beta_0\), \(\beta_1\) and \(\sigma\) associated with the minimum value of the negative log-likelihood function \(L_y\). We start creating a function in R that calculates this function for the data:
The command above creates an object of the class function with the
three parameters of the linear model as arguments. The first line of
the function calculates the expected value \(\mu\), using any value of
\({\beta_0,\beta_1}\) you supply to the function. The second line
calculates the value of the normal density function for each value of
\(ln S\), takes the logarithm of these densities, and sum all them
up. For instance, the negative log-likelihood of this model for this
data using \(\beta_0 = 1, \beta_1 = 1\) and \(\sigma = 1\) is
LL1(beta0 = 1, beta1 = 1, sigma =1)
[1] 2432.565
Now we call an optimization routine to minimize the function defined
above. We will use the front-end function mle2, from the package
bbmle2. The R function mle2 was suited to the specific problem of
computational minimization of negative log-likelihood functions. You
can find more details in the help page of mle2, and also in the
vignette of bbmle package.
Run the optimization and store the fit in an object called ml1:
This object has the usual methods for classes of statistical models fitted in R. For example, you get a summary of the fit with:
summary(ml1)
The estimated values of the parameters are the maximum likelihood estimates (MLE). Store them in a separate object:
ml1.cf <- coef(ml1)
And then you can use these estimates to plot the fitted regression line :
How precise are the MLEs we got from this fit? We can check this with
likelihood profiles. For models fitted with the function mle2,
the package
sads has
the function plotprofmle, to plot these profiles:
ml1.p <- profile(ml1, alpha =0.05) ## calculate the profile
## A graphic window for 3 plots
par(mfrow = c(1,3))
## The plots
plotprofmle(ml1.p, ylim = c(0,3))
The plot shows the negative log-likelihood re-rescaled to have a
minimum at zero. The blue line marks the limit of \(L_y(\theta) = 2\). It
is a popular rule that parameter values that result in a likelihood
value below this threshold are equally plausible. With this we define
a plausible interval for the MLEs. The function
likelregions returns the limits of the plausible interval:
likelregions(ml1.p)
The parameter of theoretical importance is the power coefficient \(\beta_1\). The plausible interval is similar to the range of recorded values of this coefficient for other mainland systems of patches (see for example Fig. 2 in Ohyama et al 2021).
The least-square methods (LS) is the maximum likelihood solution for a
simple linear Gaussian regression. You can check this by fitting the
same model with the function lm
and then comparing the coefficients of both fits:
(ls1.cf <- coef(ls1)) # LS
ml1.cf[1:2]
In simple cases like that, the limits of the frequentist confidence interval also fall close those of the plausibility interval:
confint(ls1)
likelregions(ml1.p)
There is a systematic difference, though. The MLE of the \(\sigma\) parameter is the mean of the squared residuals:
\[\hat{\sigma}=\sqrt{\frac{\sum(y_i-\hat{y})^2}{n}}\]
You can check this:
The LS method estimates \(\sigma\) with a bias correction for small samples (\(n-2\)):
\[\hat{\sigma}=\sqrt{\frac{\sum(y_i-\hat{y})^2}{n-2}}\]
Which you can also check:
There are dozens of alternative models to describe the species-area relationship (e.g. Tjørve 2009, Dengler, 2009). We will fit two of them and then use the Akaike Information Criterion (AIC) to check which one is better supported by the data.
Additionally to the power function that we have fitted, we will also try the Gleason’s logarithmic function:
\[S = \beta_0 + \beta_1 \ln A\]
And Kobayashi’s (1975) model:
\[S = \beta_0 \log \left(1 + \frac{A}{\beta_1}\right)\]
The response in these models is \(S\), so to make the model comparison
we will refit the power model to \(S\) (instead \(\ln S\) as we did
above). We will keep the Gaussian distribution to describe the
variation around the expected values. Thus, the power and the
Kobayashi models turn to be the expected values for nonlinear
Gaussian regressions, that we fit with the function nls:
This function performs better with good starting values. For these fits, we have some nice guesses to start, from the linear fit.
The logarithmic model is actually a linear function to \(\ln A\). So, we
could fit this model with the lm function. But let’s do that again
using optimization:
Let’s check the curves of the expected values by these three models:
## coefficients
nls1.cf <- coef(nls1)
nls2.cf <- coef(nls2)
ml2.cf <- coef(ml2)
plot(Species.Richness ~ Island_area,
data = suarez, xlab = "Patch area (m2)",
ylab = "Number of species")
curve(nls1.cf[1] * x^nls1.cf[2], add=TRUE, col=1)
curve(ml2.cf[1] + ml2.cf[2]*log(x), add=TRUE, col=2)
curve(nls2.cf[1] * log(1+ x/nls2.cf[2]), add=TRUE, col =3)
legend("topleft", c("Power", "Logarithmic", "Kobayashi"),
lwd=1, col = 1:3, bty="n")
It is not easy to guess which model performed better. Use the function
AICctab to get a model selection table:
Using our canonical rule, models with \(\Delta \text{AIC} \leq 2\) are equally plausible. Therefore, we can say that this data rules out only the logarithmic model.
AIC spots the models that best approximate the unknown process that generates our data. So, models best ranked by AIC should provide best predictions from future samples.
Of course we could check this if we take another sample, and see if the selected model provides good predictions of the response from the new measurements of the predictor variables in this new sample.
Cross-validation is a class of statistical methods to approximate this situation with only the sample at hand. The general idea is to split your data in a training set, and a testing set. We fit the competing models using the training set and then we use the fitted model to predict the response variable in the testing set. The goodness of fit is usually measured by the mean square error (MSE), which is the average of the squared residuals. Models with small MSEs are those with higher predictive power.
In the Leave-one-out cross-validation, the testing set are single observations taken once from the data set. Therefore, the training set is the entire data set minus the observation taken for the testing set. The procedure is repeated by leaving each single observation out.
There are many libraries for all flavors of cross-validation in R. But to illustrate we will run a leave-one-out cross validation with a loop that shows each step in a (hopefully) easy to read code:
## A matrix to store results
results <- matrix(NA, ncol =3, nrow = nrow(suarez) ,
dimnames=list(NULL, c("Power", "Logarithmic", "Kobayashi")))
## A loop to run all fits leaving one observation out each time
for(i in 1:nrow(suarez)) {
## Leave one observation out
data <- suarez[-i,]
## fit the models for each data with an observation left out
m1 <- nls(Species.Richness ~ beta0*Island_area^beta1,
data = data,
start = list(beta0 = nls1.cf[1], beta1 = nls1.cf[2]))
m2 <- lm(Species.Richness ~ log(Island_area), data = data)
m3 <- nls(Species.Richness ~ beta0*log(1 + Island_area/beta1),
data = data,
start = list(beta0 = nls2.cf[1], beta1 = nls2.cf[2]))
## Predicted value for the observation left out
m1.pred <- predict(m1, newdata = suarez[i,])
m2.pred <- predict(m2, newdata = suarez[i,])
m3.pred <- predict(m3, newdata = suarez[i,])
## Residuals
results[i,] <- suarez$Species.Richness[i] - c(m1.pred, m2.pred, m3.pred)
}
And here we calculate the MSE:
As expected, we get a result pretty similar to the model selection with AIC.
Dengler, J. (2009). Which function describes the species–area relationship best? A review and empirical evaluation. Journal of Biogeography 36, 728–744.
Kobayashi, S. (1975). The species-area relation II. A second model for continuous sampling. Researches on Population Ecology, 16(2), 265-280.
Ohyama, L., Holt, R. D., Matthews, T. J., & Lucky, A. (2021). The species–area relationship in ant ecology. Journal of Biogeography, 48(8), 1824-1841.
Suarez, A. V., Bolger, D. T., & Case, T. J. (1998). Effects of fragmentation and invasion on native ant communities in coastal southern California. Ecology, 79(6), 2041-2056.
Tjørve, E. (2009). Shapes and functions of species–area curves (II): a review of new models and parameterizations. Journal of Biogeography 36: 1435-1445.